ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

ECPR

Install the app

Install this application on your home screen for quick and easy access when you’re on the go.

Just tap Share then “Add to Home Screen”

Your subscription could not be saved. Please try again.
Your subscription to the ECPR Methods School offers and updates newsletter has been successful.

Discover ECPR's Latest Methods Course Offerings

We use Brevo as our email marketing platform. By clicking below to submit this form, you acknowledge that the information you provided will be transferred to Brevo for processing in accordance with their terms of use.

Machine Learning with Big Data for Social Scientists

Member rate £492.50
Non-Member rate £985.00

Save £45 Loyalty discount applied automatically*
Save 5% on each additional course booked

*If you attended our Methods School in the July/August 2023 or February 2024.

Course Dates and Times

Monday 16 ꟷ Friday 20 August 2021
2 hours of live teaching per day
14:00 - 16:00 CET

Akitaka Matsuo

a.matsuo@essex.ac.uk

University of Essex

This course provides a highly interactive online teaching and learning environment, using state of the art online pedagogical tools. It is designed for a demanding audience (researchers, professional analysts, advanced students) and capped at a maximum of 16 participants so that the teaching team can cater to the specific needs of each individual.

Purpose of the course

You will learn how to work with big data using R, through various available solutions.

You’ll also gain insights from the data through basic machine learning techniques, and from coding tutorials.

ECTS Credits

3 credits Engage fully with class activities 
4 credits Complete a post-class assignment


Instructor Bio

Akitaka is a Postdoctoral Research Fellow at the Institute for Analytics and Data Science (IADS). Before joining IADS, he was a Research Fellow in Data Science in LSE's Department of Methodology. He earned his PhD in political science at Rice University in Houston. 

His research interests lie in data science and politics, in particular in the statistical methodology for scaling political behaviour, and natural language processing of political texts.

Twitter @amatsuo_net
The course has two core topics:
  • constructing and managing large datasets in R
  • machine learning, with specific focus on providing analytics from large datasets. 

For the first topic of big data, we start by asking: What is big data? Why it is difficult to work with? We then learn the best solutions in R for working with big data depending on the size of data, from locally stored data objects to databases hosted on the cloud. 

For the second topic of machine learning, we will learn basic concepts such as:

  1. problem definitions
  2. objective function
  3. bias-variance tradeoffs
  4. parameter tuning. 

Social scientists have traditionally emphasised the explanation as the primary purpose of statistical analysis. Machine learning has an overlapping but evidently different orientation. By contrasting the inference-based approach and prediction-focused approach, you get to understand the fundamental ideas of machine learning. The application of machine learning techniques to various analytical tasks in social sciences will follow these theoretical discussions.

The daily topics are listed below.

Day 1

Data Management in R
Discussions on big data, what it is and why it's difficult to work with. Set up of computation environments in the cloud is covered, cloud computing concepts are introduced and why the shift to cloud computing is occuring. To conclude day 1, you will work with large data in R, starting from R objects (dataframe, tibble, etc), databases, and distributed-computing framework. 

Day 2

Databases & Machine Learning Basics
Learn the methods for handling big data using various types of databases, which unload the data from R working space while keeping it highly accessible. Explore the basics of relational databases, such as concepts, rules, and design followed by the syntax of SQL query. Participate in general discussions on the logic of machine learning to understand the basics.

Day 3

Regression
Revisting linear regression, learn how to intrepret the regression problem in the machine learning framework, covering cases where the outcome is a continious quantity. Explore the issue of variable selection which is frequently faced with big data with numerous input features. Learn about typical shrinkage methods such as Ridge regression and LASSO. The concept of the resampling method will also be introduced.

Day 4

Classification Methods 1
Moving to cases where the outcome is categorical, you will participate in discussions on the supervised classification where the machine learning algorithms are applied to predict known values of outputs. Using a classification method familar to social scientists, learn how to evaluate the models in machine learning framework. A more detailed discussion of model performance evaluation will be provided.

Day 5

Classification Methods 2
Continuing on from Day 4, the focus is on various tree-based methods. Starting with a simple tree, moving on to more sophisicated methods such as random forest and boosting. To finish, we will revist the issue of data size and get an overview of the methodology for distributed computing, which has the capability of getting insights from big data as a whole.

How the course will work online

The course provides approximately 5 hours of pre-recorded lectures as well as an online forum on Slack where the Instructor and students can freely discuss the lecture materials and coding. 

Approximately two hours of each day will be an online seminar, where we will learn how to apply the concepts and knowledge gained from pre-course lecture materials through Q&A and the live lab work. 

In the live lab, you will be given several coding tasks, and asked to code along with the Instructor. Some tasks are left as homework, which will be discussed on the online forum and during the following day’s live lab. 

You’ll also be able to advance-book one-to-one consultations with the Instructor during office hours scheduled in advance. 

You will learn how to work with cloud computational workspace using R as a primary statistical software with RStudio server and Google Colab. Example scripts and assignments are distributed through github so you can learn how to use online version control systems for collaboration and research accountability.

The course assumes you have some familiarity with R statistical language and can conduct basic data handling in R (opening data files, working with data frames). If you don’t have this, take the week-one course Introduction to R

You should also have basic knowledge of standard statistical analysis in social science, such as linear regression and hypothesis testing. These are covered in courses Introduction to Inferential Statistics and Big Data Collection and Management in R.

The course consists of two activities which are to be completed before Day 1.

  • Reading - consisting mostly of textbook type materials. Approximately two hours for each day of the course. (10 hours)
  • Lectures - pre-recorded videos accompanied by supporting materials. A one hour video for each day of the course. (5 hours)
Day Topic Details
1 Cloud Computing and Handling Big Data in R

Session 1
Big data in social sciences

Session 2
R environment for big data analytics

2 Databases in Local and Cloud

Session 1
Database Basics, SQL Syntax

Session 2
Summarising Big Data in R

3 Machine Learning Fundamentals

Session 1
Fundamentals of machine learning

Session 2
Resampling methods

4 Regression problems

Section 1
Linear Regression in machine learning

Section 2
Regularisation based methods (LASSO and Ridge Regression)

5 Classification problems

Section 1
Supervised classification

Section 2
Clustering with Big Data

Day Readings
1

Simon Walkowiak, 2016
Big Data Analytics with R Chapters 1, 3 (optionally: Chapter 2)
Packt Publishing

2

Simon Walkowiak, 2016
Big Data Analytics with R Chapter 5
Packt Publishing

3

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, 2014
An Introduction to Statistical Learning with Applications in R
Chapters 2, 5
Springer

4

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, 2014
An Introduction to Statistical Learning with Applications in R
Chapters 3, 6
Springer

5

Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, 2014
An Introduction to Statistical Learning with Applications in R
Chapters 4, 8.1–2, 10.3
Springer